Rule Your Data with The Link King© (a SAS/AF® application for record linkage and unduplication)
نویسنده
چکیده
Administrative datasets containing client identifying information (names, birthdates, SSNs) are often used for a variety of research and evaluation projects. The projects often require the linking of two or more independently maintained client rosters in order to track service utilization across different systems. Unfortunately, a given client may be represented with slightly different identifying information both within and across administrative datasets. Discrepancies arise from a variety of reasons including: • Use of nicknames • Hyphenated names • Misspelled names • Transposed SSN digits • Transposed date fields Failure to identify and appropriately deal with this problem may lead to incomplete linking of client records and, ultimately, introduce unnecessary error into the research or evaluation project. This paper introduces The Link King a SAS/AF application for use in the linkage and unduplication of administrative datasets. The Link King features a data importing and formatting wizard, artificial intelligence to insure appropriate linking protocols are used, a powerful interface for manual review of "uncertain" linkages, an ability to generate random samples of links for validation, and easy "point-and-click" editing of the final roster of consolidated records. Visit www.the-link-king.com for more information about this public domain software or to download The Link King. RECORD LINKAGE AND CONSOLIDATION ALGORITHMS There are two approaches to the linkage and unduplication of client identifiers in administrative datasets: deterministic linking and probabilistic linking. Probabilistic linking is accomplished through the application of sophisticated statistical analysis. Ultimately, a formula is derived which generates a score for each record pair and cut points to identify “definite” matches, “possible” matches, and “non matches”. The formula incorporates weights specific to each of the data elements and scaling factors for many of the data elements. The weights reflect the relative importance of specific data elements in predicting a match. The scaling factors adjust the weights for a given record pair based on the “rarity” of the data value. For example, the scaling factor for the last name “Freud” would be much larger than that for the last name “Smith”. The probabilistic algorithms used by The Link King were developed by MEDSTAT for the Substance Abuse and Mental Health Administration’s (SAMHSA) Integrated database project. Deterministic linking is accomplished by establishing specific criteria about what combination of data elements need to “match” and quality of the “match” in order to accept the link as valid. For example, one criterion to consider two client records a “match” might be that all of the following conditions must be met: First Names: Must have an Approximate String Match Algorithm score of .75 or Higher Last Names: Must have an Approximate String Match Algorithm score of .75 or Higher 1 The Technical Monograph and original SAS program code are available for download at www.the-link-king.com Applications Development SUGI 30
منابع مشابه
Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a 'basic' deterministic algorithm
The study objective was to compare the accuracy of a deterministic record linkage algorithm and two public domain software applications for record linkage (The Link King and Link Plus). The three algorithms were used to unduplicate an administrative database containing personal identifiers for over 500,000 clients. Subsequently, a random sample of linked records was submitted to four research s...
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملHierarchical Bayesian Record Linkage Theory
In record linkage, or exact file matching, one compares two or more files on a single population for purposes of unduplication or production of an enhanced, merged database. Record linkage has many applications, including in population enumeration efforts, to create databases for epidemiological investigations, and to improve survey sample frames. Latent class and mixture models have been used ...
متن کاملRecord Linkage for Genealogical Databases
In this paper we describe past experience and outline current directions in performing record linkage over large genealogical databases. 1. INTRODUCTION AND MOTIVATION Record linkage is the problem of identifying multiple records that refer to the same real-world entity. In genealogical databases, it is the problem of identifying when individuals situated in different pedigrees refer to the sam...
متن کاملEvaluating Multipath TCP Resilience against Link Failures
Standard TCP is the de facto reliable transfer protocol for the Internet. It is designed to establish a reliable connection using only a single network interface. However, standard TCP with single interfacing performs poorly due to intermittent node connectivity. This requires the re-establishment of connections as the IP addresses change. Multi-path TCP (MPTCP) has emerged to utilize multiple ...
متن کامل